Skip to content

Conversation

@kryanbeane
Copy link
Contributor

@kryanbeane kryanbeane commented Aug 27, 2025

Issue link

RHOAIENG-30720

What changes have been made

Removed GCS FT from the Lifecycled RayJob implementation. Will discuss this with the team but my reasoning is as follows:

If GCS FT was enabled with RayJobs as the entrypoint, and GCS fails the following will happen:

  • RayCluster is no longer in RUNNING state and so the RayJob will be set to FAILED but Kuberay Operator
  • This will tell Kuberay Operator to delete the RayCluster

The only case where a RayJob will retry is if backoffLimit is set in job. This would signal to Kuberay how many times to spin up a new RayCluster for the Job, but it does not try to recover the same RayCluster. - making GCS FT for this useless.

In long lived RayCluster scenario, if GCS fails and FT is enabled, the same RayCluster restarts, a new one is not created and so GCS FT does provide value.

I will include a second PR to fix GCS FT for long-lived RayClusters.

Note: We could propose an issue to get GCS FT working upstream, it would require Kuberay Operator to recognise that GCS FT is enabled for the Lifecycled RayCluster, and try to restart that Cluster instead of spinning up a new one. This logic doesn't exist as far as I could see so we could make an issue for it.

Verification steps

N/A

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Aug 27, 2025

@kryanbeane: This pull request references RHOAIENG-30720 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-30720

What changes have been made

Removed GCS FT from the Lifecycled RayJob implementation. Will discuss this with the team but my reasoning is as follows:

If GCS FT was enabled and working with RayJobs as the entrypoint:

  • GCS fails for some reason
  • RayCluster is not longer in RUNNING state and so the RayJob will be set to FAILED
  • This will tell Kuberay Operator to delete the RayCluster

The only case where a RayJob will retry is if backoffLimit is set in job. This would signal to Kuberay how many times to spin up a new RayCluster for the Job, but it does not try to recover the same RayCluster. - making GCS FT for this useless.

In long lived RayCluster scenario, if GCS fails and FT is enabled, the same RayCluster restarts, a new one is not created and so GCS FT does provide value.

I will include a second PR to fix GCS FT for long-lived RayClusters.

Verification steps

N/A

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from dimakis August 27, 2025 17:59
@codecov
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.48%. Comparing base (cb5589c) to head (1f2241e).
⚠️ Report is 1 commits behind head on ray-jobs-feature.

Additional details and impacted files
@@                 Coverage Diff                  @@
##           ray-jobs-feature     #892      +/-   ##
====================================================
+ Coverage             93.45%   93.48%   +0.03%     
====================================================
  Files                    21       21              
  Lines                  1910     1889      -21     
====================================================
- Hits                   1785     1766      -19     
+ Misses                  125      123       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Aug 27, 2025

@kryanbeane: This pull request references RHOAIENG-30720 which is a valid jira issue.

In response to this:

Issue link

RHOAIENG-30720

What changes have been made

Removed GCS FT from the Lifecycled RayJob implementation. Will discuss this with the team but my reasoning is as follows:

If GCS FT was enabled with RayJobs as the entrypoint, and GCS fails the following will happen:

  • RayCluster is no longer in RUNNING state and so the RayJob will be set to FAILED but Kuberay Operator
  • This will tell Kuberay Operator to delete the RayCluster

The only case where a RayJob will retry is if backoffLimit is set in job. This would signal to Kuberay how many times to spin up a new RayCluster for the Job, but it does not try to recover the same RayCluster. - making GCS FT for this useless.

In long lived RayCluster scenario, if GCS fails and FT is enabled, the same RayCluster restarts, a new one is not created and so GCS FT does provide value.

I will include a second PR to fix GCS FT for long-lived RayClusters.

Note: We could propose an issue to get GCS FT working upstream, it would require Kuberay Operator to recognise that GCS FT is enabled for the Lifecycled RayCluster, and try to restart that Cluster instead of spinning up a new one. This logic doesn't exist as far as I could see so we could make an issue for it.

Verification steps

N/A

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kryanbeane kryanbeane force-pushed the head-pod-persistance-fix branch 6 times, most recently from b105e4c to fd045a5 Compare August 28, 2025 16:47
RAY_VERSION = "2.47.1"
# Below references ray:2.47.1-py311-cu121
CUDA_RUNTIME_IMAGE = "quay.io/modh/ray@sha256:6d076aeb38ab3c34a6a2ef0f58dc667089aa15826fa08a73273c629333e12f1e"
MOUNT_PATH = "/home/ray/scripts"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while i'm here, just moving the new mount path to a constant since it's referenced in so many places

@kryanbeane kryanbeane force-pushed the head-pod-persistance-fix branch from 2c23337 to 1f2241e Compare September 1, 2025 10:37
Copy link
Contributor

@LilyLinh LilyLinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Great work! Thanks Bryan! :)
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 1, 2025
@pawelpaszki
Copy link
Contributor

looks good to me. ran a sample test against ROSA cluster successfully. Feel free to merge

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LilyLinh, pawelpaszki

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 416ba8d into project-codeflare:ray-jobs-feature Sep 3, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants